By tokenizing, you can conveniently split up text by word or by sentence
Tokenizing by word : Words are like the atoms of natural language
sent_tokenize(str: sentences) -> list
Tokenizing by sentence (CONTEXT BASED Results): analyze how those words relate to one another and see more context.
word_tokenize(str: sentences) -> list
from nltk.tokenize import sent_tokenize, word_tokenize #tokenization
from nltk.corpus import stopwords
stop_words: list = stopwords.words("english")
Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Stop words since they don’t add a lot of meaning to a text in and of themselves.
Examples: 'in', 'is', and 'an'
stopwords.words('english')
includes only lowercase versions of stop words
tokens: list = sent_tokenize("Hello there beautiful")
stop_words: list = stopwords.words(english)
# ^-- only lowercase versions of stop words...
filtered_list = [ # ^-- So you must `.lower()` or `.casefold()` (<-both the same) the `tokens`
token for token in tokens if token.casefold() not in stop_words
]
NLTK has more than one stemmer, but you’ll be using the Porter stemmer.
Basically, tagging speech makes it more accurate
EXAMPLE:
lemmatizer.lemmatize("worst") # worst, assumed worst was a noun
lemmatizer.lemmatize("worst", pos="a") # bad, originally it is an adjective in the context of the sentence
CODE COPY PASTE #2:
import nltk
from nltk.tokenize import word_tokenize
tokens = word_tokenize("Let's stop learning technology and just start planting. Food will always be in demand")
tokens: list[tuple] = nltk.pos_tag(tokens)
For the meaning of each tags: https://realpython.com/nltk-nlp-python/#tagging-parts-of-speech:~:text=The%20list%20is,Show/Hide
Reference the CODE SNIPPET #2 above
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Function to map POS tags from Penn Treebank to WordNet
def penn_to_wordnet(tag):
# I recommend to remove stop words first
if tag.startswith('J'):
return nltk.corpus.wordnet.ADJ
elif tag.startswith('V'):
return nltk.corpus.wordnet.VERB
elif tag.startswith('N'):
return nltk.corpus.wordnet.NOUN
elif tag.startswith('R'):
return nltk.corpus.wordnet.ADV
else:
return nltk.corpus.wordnet.NOUN # Default
# Lemmatize the sentence using the mapped POS tags
lemmatized_words = [
lemmatizer.lemmatize(word=token, pos=penn_to_wordnet(tag)) for token, tag in tokens
]